Automatic Image Selection Model Based on Machine Learning for Endobronchial Ultrasound Strain Elastography Videos

Background Endoscopic ultrasound (EBUS) strain elastography can diagnose intrathoracic benign and malignant lymph nodes (LNs) by reflecting the relative stiffness of tissues. Due to strong subjectivity, it is difficult to give full play to the diagnostic efficiency of strain elastography. This study aims to use machine learning to automatically select high-quality and stable representative images from EBUS strain elastography videos. Methods LNs with qualified strain elastography videos from June 2019 to November 2019 were enrolled in the training and validation sets randomly at a quantity ratio of 3:1 to train an automatic image selection model using machine learning algorithm. The strain elastography videos in December 2019 were used as the test set, from which three representative images were selected for each LN by the model. Meanwhile, three experts and three trainees selected one representative image severally for each LN on the test set. Qualitative grading score and four quantitative methods were used to evaluate images above to assess the performance of the automatic image selection model. Results A total of 415 LNs were included in the training and validation sets and 91 LNs in the test set. Result of the qualitative grading score showed that there was no statistical difference between the three images selected by the machine learning model. Coefficient of variation (CV) values of the four quantitative methods in the machine learning group were all lower than the corresponding CV values in the expert and trainee groups, which demonstrated great stability of the machine learning model. Diagnostic performance analysis on the four quantitative methods showed that the diagnostic accuracies were range from 70.33% to 73.63% in the trainee group, 78.02% to 83.52% in the machine learning group, and 80.22% to 82.42% in the expert group. Moreover, there were no statistical differences in corresponding mean values of the four quantitative methods between the machine learning and expert groups (p >0.05). Conclusion The automatic image selection model established in this study can help select stable and high-quality representative images from EBUS strain elastography videos, which has great potential in the diagnosis of intrathoracic LNs.


INTRODUCTION
The differential diagnosis of malignant and benign intrathoracic lymph nodes (LNs) is an important medical problem related to the diagnosis and prognosis of intrathoracic diseases. Compared with surgical examination, needle techniques are recommended as the first choice to obtain tissues (1,2). Endobronchial ultrasound guided transbronchial needle aspiration (EBUS-TBNA) is an important minimally invasive tool to evaluate the benign and malignant intrathoracic LNs.
Previous literature mentioned that ultrasonographic features were suggested to be used for predicting benign and malignant diagnosis of patients undergoing EBUS-TBNA (3). EBUS imaging includes three modes of grayscale, blood flow Doppler and strain elastography. Studies indicated that strain elastography had the best diagnostic value among the three modes (4,5). Elastography has been widely used in breast lesions, thyroid, pancreas, prostate, liver and endoscopic ultrasound (6)(7)(8)(9)(10)(11). Through exerting repeated and slight pressure on the examined lesions, elastography can quantify the elasticity of tissues by measuring the deformation and present it in the form of various colors (12)(13)(14). The colors from yellow/red, green to blue represent tissues from lower to higher relative stiffness, respectively (13). Malignant infiltration of tumor cells can alter cell morphology and overall histology of tissues resulting in a stiffer property. Elastography can indirectly predict malignant lesions by reflecting its relative stiffness (15). EBUS strain elastography plays an important role in differentiating intrathoracic benign and malignant LNs (16). The bronchoscopist can select the target LN and possible metastatic sites within the LN for biopsy according to strain elastography during EBUS-TBNA (17,18).
With respect to qualitative analysis of strain elastography image, the five-score grading method had specific classification and when score 1-3 was defined as benign and score 4-5 as malignant, the diagnostic accuracy in predicting malignant LNs can reach 83.32% (4). Quantitative methods include stiff area ratio (SAR), elasticity ratio of blue/green (B/G), mean hue value and mean gray value can comprehensively evaluate the quality of elastography images (4,5,(18)(19)(20)(21)(22)(23). Qualitative methods are more convenient for clinical application, but strong subjectivity exists inevitably. Therefore, doctors with various experience may have different judgement on the same strain elastography video. Although the quantitative method are relatively objective, the images used for quantitative analysis are still selected subjectively. Moreover, ultrasound imaging is the specialty of ultrasound doctors, and endoscopists may not make full use of strain elastography due to the limited experience.
In recent years, with the development of machine learning algorithm, machine learning has shown an important role in the field of medical imaging with favorable performance, such as skin cancer, retinal fundus photographs, gastrointestinal endoscopy, chest CT and other aspects (24)(25)(26)(27). By extracting multiple quantitative image features which may be difficult for doctors to observe, machine learning can give a likelihood of each case and classify images accurately. Research demonstrated that machine learning combined with colorectal endoscopy for colorectal lesions diagnosis was comparable to that of experts (28). In the field of bronchoscopy, a computer-assisted diagnosis (CAD) system has been used to classify normal mucosa, chronic bronchitis and lung tumors under the white-light bronchoscopy, which achieved a classification rate of 80% (29). In addition, a machine learning texture model can get an accuracy of 86% in classifying cancer subtypes using bronchoscopic findings (30). However, there are few applications of machine learning on EBUS strain elastography. Therefore, the purpose of this study was to establish a machine learning model which can realize automatic selection of representative images from strain elastography videos.

Patients and LNs
Patients who met the following criteria and underwent EBUS-TBNA examination in Shanghai Chest Hospital from June to December 2019 were enrolled in this study (1): At least one enlarged intrathoracic LNs (short axis >1 cm) based on computed tomography (CT), or at least one positive 18 F-FDG uptake detected (standardized uptake value >2.5) by positron emission tomography (2); Pathological confirmation was clinically required and EBUS-TBNA examination was feasible (3); Patients who did not have contraindications to EBUS-TBNA and signed informed consent. LNs with qualified strain elastography videos were analyzed in the study. LNs in December were used as the test set to assess the automatic representative images selection model and the remained were used as the training set and validation set. This study was approved by the local Ethics Committee of Shanghai Chest Hospital (No. KS1947) and registered at ClinicalTrials.gov PRS (NCT04328792).

EBUS Strain Elastography Procedure
LNs were examined by the ultrasound bronchoscopy (BF-UC260FW, Olympus, Tokyo, Japan) and EBUS strain elastography videos were recorded by the ultrasound processor (EU-ME2, Olympus, Tokyo, Japan). The operator detected the location of the target LN and measured the EBUS size at the maximal cross-section of grayscale mode. After observing the grayscale and blood flow Doppler modes, the operator switches to the strain elastography mode and elastography imaging was formed through the patient's respiration, cardiac impulse and blood vessel pulse generally. In the case of unsatisfactory imaging, the operator shall exert appropriate pressure to the target LN by pressing the up-down angle lever of bronchoscope at a frequency of three to five times per second to obtain better imaging. The maximal cross-section of the LN was recorded and two 20-second videos were saved (4). Subsequently, EBUS-TBNA was performed to obtain the cytological specimens for pathological examination. All operators retained strain elastography videos and sampled LNs according to the above standard steps. The final diagnosis of LNs was determined on EBUS-TBNA, thoracoscopy, mediastinoscopy, transthoracic thoracotomy or other pathological examinations, microbiological examination or clinical follow-up for more than one year.

Development of Automatic Representative Images Selection Model for Strain Elastography Videos
The training set and validation set were randomly divided at a quantity ratio of 3:1 to train the model with optimal hyperparameters. The same proportion of benign and malignant LNs was maintained in the two datasets. We developed models with various values of hyper-parameters on the training set and assessed these models on the validation set to determine the hyper-parameter according to the performance. Once the hyperparameter was determined, we used both the training set and validation set to train the model for prediction and evaluation on the test set. In this paper, the hyper-parameters included the number of representation patterns and whether adopting the update-and-predict strategy or not. The candidate numbers of representation patterns included 32, 64, and 128. Blind to the final diagnosis of LNs, two experts with experience of EBUS images observation >500 LNs assessed the image quality of the validation sets together as following: score 1 (scattered soft, mixed green-yellow-red), score 2 (homogeneous soft, predominantly green), score 3 (intermediate, mixed bluegreen-yellow-red), score 4 (scattered hard, mixed blue-green), score 5 (homogeneous hard, predominantly blue). Scores 1-3 are classified as benign and 4-5 as malignant (4). Four quantitative methods were also used to verify the diagnostic performance of the validation sets. Assessments on the validation set showed that we could yield the highest accuracy when adopting run-twice strategy and using 64 representative patterns ( Supplementary  Tables 1 and 2). When we trained the model with determined hyper-parameters, we used it to make prediction on the test set. Note that the test set is not used in the phase of training. Figure 1 illustrated the process of representative strain elastography images selection with the proposed machine learning algorithm. Initially, the elastography video was converted into a sequence of frames with quality evaluated. According to the proportion of colored pixels and relative intensity of a frame (Supplementary Material and Supplementary Figure 1), the original frames were divided into qualified and unqualified, and only qualified frames were kept for succeeding procedures. Additionally, to avoid overwhelming qualified frames and reduce complexity, the adjacent two frames of selected qualified frame were dropped. Then, feature engineering was performed on the remaining frames. We constructed the features of each frame with the 512 bin color histogram to describe the color distribution of elastography images (31). Further, the principal component analysis (PCA) algorithm was applied to reduce the feature dimension, and a 40-dimension feature space was obtained. The number of dimensions depended on the training set, and 40dimension was capable to keep 99% component in this study. Clustering plays an important role in video analysis (32)(33)(34)(35). Considering the selective principle of experts that the most repeatable pattern across the video is selected as representative frames, we employed the k-means clustering algorithm in this study. In the phase of training, the k-means clustering was performed on the training set to obtain representative patterns (cluster centers). In the phase of prediction, the frame features from the test video were allocated to patterns extracted from the training set. Given a test video, the pattern owning most frames was regarded as the representative pattern and three frames closest to the representative pattern were selected as the representative images. FIGURE 1 | The process of automatic selection of representative images. Frames were extracted from the video stream to construct a frame pool initially. Then, inferior frames are dropped during the quality evaluation procedure, and the eligible frames are kept as candidates for representative images. Next, the PCA was employed for dimension reduction. Ultimately, the clustering model select representative images from candidates. PCA, principal component analysis.
In real-world applications, however, it is hard to collect a training set that has sufficient examples to cover all possible situations and guarantee the generalization ability of the trained model. Consequently, a limited training set usually leads to a performance gap, when applied to the real data. To narrow this gap, in the phase of prediction, we proposed an update-andpredict strategy that ran the trained model twice on the test set. The first run produced the initial predictions of test videos which were used for updating the cluster centers in the model. Subsequently, the updated model was used to obtain the final predictions on the test set. Note that the K-means clustering is an unsupervised learning algorithm that does not require manual annotation or ground truth. Therefore, we leveraged K-means clustering in this paper to update our model using only the predictions of test videos rather than accessing their labels. The label information was not leaked in the phase of prediction. As a result, we can narrow the gap between the training set and test set and do not cause the leakage of label (supervision) information by using the update-and-predict strategy.

Evaluation of Representative Images
For the three images selected by the automatic image selection model on the test set, the same two experts evaluated grading score together. Expert group and trainee group (experience of CP-EBUS image observation less than 30 LNs) were employed to select representative images which were used for comparison with that of machine learning. The three experts reviewed two elastography videos of each LN and selected one representative image for qualitative evaluation, respectively. Qualified images shall cover the maximal cross-section of the target LN and have good repeatability (4). Three trainees selected representative frames and evaluated qualitative score of corresponding pictures in the same way. The quantitative measurement of the three groups of images was operated by the elastography quantitative system (Registration number: 2015SR191866) developed by Matlab and the region of interest was outlined by an expert (Figure 2). Results of four quantitative methods including SAR, B/G, mean hue value and mean gray value were output by the program. The first method SAR was the ratio of blue pixels to pixels of the whole LN (5,(18)(19)(20). RGB is a color space model which represents the red, green and blue channel colors, and B/G was calculated in this study (21). Hue histogram analysis was performed for selected images and the third method mean hue value corresponds to the global elasticity of the LN (22). The fourth method mean gray value has been studied in the diagnosis of breast cancer and intrathoracic LNs (4, 23). All above procedures carried by experts and trainees were in the situation of blind to the clinical information and pathological results of target LNs.

Statistical Analysis
For qualitative score, the Friedman test was used for the differences among the three images selected by the automatic image selection model and experts, and the Wilcoxon signedrank test was used for the pair comparison. For quantitative variables, receiver operating characteristic (ROC) curve was used to obtain the area under the curve (AUC) and the cut-off value with the best diagnostic performance. The paired t-test was used for quantitative mean values comparison between images of the machine learning model and experts. The stability of the quantitative results within the three groups was evaluated using the coefficient of variation (CV), and the comparison of the CV among the three groups was further analyzed by the paired t-test. The p value <0.05 was considered statistically significant for above statistical analyses. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy for differentiating benign and malignant LNs were calculated by the corresponding formulas. All statistical analyses were performed by SPSS version 25.0 (IBM Corp., Armonk, NY, USA).

Patients and LNs
A total of 415 LNs from 351 patients (247 men, 104 women; age: 60.45 ± 11.31 years) were analyzed in the training and validation sets, and 91 LNs from 73 patients (52 men, 21 women; age:58.82 ± 10.95 years) were used as the test set ( Table 1). 311 LNs were included in the training set and 104 LNs in the validation set. Malignant LNs accounted for 61.69% in the training and validation sets and 58.24% in the test set. Figure 3 displayed the representative images selected by machine learning, expert and trainee groups.

Stability and Diagnostic Performance Analysis by the Qualitative Grading Score
To evaluate the stability of machine learning selected images, we analyzed the differences between the three pictures of machine learning, expert and trainee groups, respectively. Results demonstrated that there was a statistical difference in the expert group, while the images of machine learning and trainee groups were relatively stable ( Table 2). Besides, diagnostic performance in Table 3 showed that the diagnostic accuracies of machine learning group were 82.42, 79.12 and 75.82% respectively, slightly lower than experts (p = 0.121), but significantly higher than trainee group (p <0.001).

Stability Analysis by Quantitative Methods
In order to assess machine learning selected images more objectively, paired t-test was conducted on quantitative results of machine learning and expert groups. No statistical difference between the two groups was found which demonstrated that images selected by machine learning can reach the expert level ( Table 4). In terms of the stability analysis of the images within and between the three groups, Table 5 showed that CV values of machine learning group were lower than expert and trainee groups for each indicator, and among which the trainee group had the highest CV values. Besides, for SAR and B/G, there were statistical differences between machine learning and the other two groups, indicating that machine learning selected images in the test set were more stable than those selected by expert and trainee groups ( Table 5).

Diagnostic Performance Analysis by Quantitative Methods
The ROC curves were performed on quantitative results of the three groups and cut-off values with the best diagnostic performances were drawn (Figure 4). Table 6 reflected that the accuracies of four quantitative methods including SAR, B/G, mean hue value and mean gray value in the machine learning group were 83.52%, 78.01%, 80.22% and 80.22% respectively. Correspondingly the expert groups were 80.22%, 81.32%, 82.42% and 82.42% respectively. By contrast, the best indicator in the trainee group was B/G, with the highest accuracy of only 73.63%.

DISCUSSION
Lung cancer is the leading cause of cancer associated morbidity and mortality around the world (36). Pulmonary diseases can be diagnosed by draining LNs, therefore the diagnosis of intrathoracic LNs is related to subsequent treatment strategies. EBUS strain elastography imaging is a useful noninvasive tool in differentiating benign from malignant LNs. The machine learning algorithm was used to automatically select representative images from the EBUS strain elastography videos in this study and the image quality was equivalent to the expert level. Traditional qualitative methods are convenient for clinical application, but subjectivity and the difference in experience between different doctors can affect the accurate diagnosis. Images used for quantitative analysis are still manually selected which cannot avoid subjectivity. The CV values in Table 5 reflect the instability of manual selection, and the images selected by doctors with different experience had various quality. For qualitative results, there was a statistical difference between the images selected by experts (p = 0.036) but not by trainees (p =   Table 2). However, a bigger difference presented in diagnostic accuracies among trainees than experts. This was because the diagnosis performance was calculated based on the dichotomy, that is, 1-3 were classified as benign and 4-5 as malignant, yet the differences of qualitative score were counted according to the five categories. Besides, regarding the diagnostic performance among the three groups, the qualitative diagnostic performance of expert group was the highest in the whole study. However, the quantitative results were similar to that of machine learning group, possibly due to the subjectivity of qualitative assessment among different experts. Compared with the qualitative results, the quantitative methods can evaluate the image quality selected by machine learning more objectively. Elastography can only reflect the relative hardness of target lesion, and fibrosis within sarcoidosis may result in stiffer tissue and necrosis within malignant LNs may lead to softer lesions (37,38). Thus, the highest diagnostic accuracy of automatic image selection model by qualitative and quantitative methods can only reach 83.52%, which was not only due to inaccurate image selection but also the property of the lesion itself. In addition to the four quantitative methods used in this study, strain ratio and strain histogram are also quantitative methods and study found that strain histogram showed better predictive value than strain ratio with a diagnostic rate of 82% in malignant LNs prediction (39). It can be seen that different quantitative methods can lead to various diagnostic results, and there is no unified quantitative method at present. In this study, different results were produced by the four methods in the three groups, but the quality of the images had more effect than the quantitative method on the final results. Notably, the machine learning algorithm in this study was valid for representative images selection of EBUS strain elastography videos, but the implementation of this algorithm needed integration by the manufacturer to become clinically applicable. This study still had some limitations. Since there was no restriction on the type of disease included, the machine learning model was only suitable for the diagnosis of intrathoracic LNs enlargement, and further studies were need to determine whether or not this technique is valid to the stage of lung cancer. Besides, although high-quality images were selected from elastography videos, no diagnosis was made by the model for these images, and EBUS modes of grayscale and blood flow Doppler were not applied. The automatic EBUS multimodal image selection and diagnosis may be more convenient for clinical application. Moreover, this was a single-center retrospective study with limited number of LNs and some diseases accounted for limited proportions. Prospective studies and more LNs to train, validate and test the model may acquire more stable models and more convincing results. Thus, it was worthwhile to carry out multi-center studies to improve the outcome of the model (40).   In conclusion, through the application of machine learning algorithm to EBUS strain elastography, we realized the automatic selection of high-quality and stable images from strain elastography videos. The automatic image selection model needs further prospective clinical validation and has potential value in guiding the diagnosis of intrathoracic LNs.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics Committee of Shanghai Chest Hospital. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements. Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
XZ collected videos, selected representative images, conducted qualitative and quantitative analysis of LNs, and performed statistical analysis. JL designed the machine learning model for automatic selection of representative images. JC and XZ evaluated images selected by machine learning. FX, LW, and JS selected representative images and scored them qualitatively as the expert group. JS and WD designed the study and reviewed the manuscript. JS and HX supported this study. All authors contributed to the article and approved the submitted version.